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ABSTRACT 

Using a Bayesian approach, we consider the problem of re¬ 
covering sparse signals under additive sparse and dense noise. 
Typically, sparse noise models outliers, impulse bursts or data 
loss. To handle sparse noise, existing methods simultane¬ 
ously estimate the sparse signal of interest and the sparse 
noise of no interest. For estimating the sparse signal, without 
the need of estimating the sparse noise, we construct a robust 
Relevance Vector Machine (RVM). In the RVM, sparse noise 
and ever present dense noise are treated through a combined 
noise model. The precision of combined noise is modeled 
by a diagonal matrix. We show that the new RVM update 
equations correspond to a non-symmetric sparsity inducing 
cost function. Further, the combined modeling is found to be 
computationally more efficient. We also extend the method 
to block-sparse signals and noise with known and unknown 
block structures. Through simulations, we show the perfor¬ 
mance and computation efficiency of the new RVM in sev¬ 
eral applications; recovery of sparse and block sparse signals, 
housing price prediction and image denoising. 

1. INTRODUCTION 

Noise modeling has an important role in the Bayesian infer¬ 
ence setup to achieve better robustness and accuracy. Typ¬ 
ically noise is considered to be additive and dense (possibly 
even white) in nature. In this paper we investigate the effect of 
sparse noise modeling in a standard Bayesian inference tool 
called the Relevance Vector Machine (RVM) |[T]. 

The RVM is a Bayesian sparse kernel technique for ap¬ 
plications in regression and classification m. Interest in the 
RVM can be attributed to the cause that it shares many char¬ 
acteristics of the popular support vector machine whilst pro¬ 
viding Bayesian advantages |l2l[3]|4l, mainly providing poste¬ 
riors for the object of interest. Generally the RVM is a fully 
Bayesian technique that aims to learn all the relevant system 
parameters iteratively to infer the object of interest. In a lin¬ 
ear model setup used for regression, RVM introduces spar- 
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sity through a weight vector where the weights are essential 
to form linear combinations of relevant kernels to predict the 
object of interest; the weight vector is a set of system pa¬ 
rameters and its sparsity leads to reduction of model com¬ 
plexity for regression. Naturally, the RVM has been further 
used for sparse representation techniques as well as develop¬ 
ing Bayesian compressive sensing methods Q. 

For a Bayesian linear model, the standard RVM uses a 
multivariate isotropic Gaussian prior to model the additive 
dense noise. Here isotropic means that the associated co- 
variance matrix is proportional to the identity matrix. Such 
a dense noise model has inherent limitations to accommodate 
instances of outliers IS El El 121 [TOl, impulse bursts |[TT1 [121 
or missing (lost) data KEllIl. We hypothesize that a sparse 
and dense noise model can accommodate for the statistics of 
a variety of noise types, without causing degradation in per¬ 
formance for any noise type compared to the standard case 
of using only a dense noise model. In this paper, we develop 
RVM for such a combined (joint) sparse and dense noise sce¬ 
nario. 

1.1. System model 

We consider the following linear system model 

y = Ax-I-e-f n, (1) 

where y G K™ is the measurements, x G M” is a sparse vec¬ 
tor (for example weights in regression or sparse signal to esti¬ 
mate in compressed sensing), A G is a known system 

matrix (for example, regressors or sampling system). Further, 
e G is sparse noise and n G is dense noise. Using 
fo-norm notation to represent the number of non-zeros in a 
vector, we assume that ||x||o ^ n and ||e||o ^ m are small 
and unknown. The random vectors x, e and n are indepen¬ 
dent. The model Q is used in face recognition lITSl . image 
denoising 0 and compressed sensing ||5l. 

1.2. Our contribution 

We develop a RVM for the model Q, by treating e -f n as 
a combined noise. By learning parameters of x and e -f n. 



we estimate x without the need of estimating e. We also con¬ 
sider the scenario where the signal x and noise e are block 
sparse. By using techniques similar to the ones in m we 
generalize the methods to signals with unknown block struc¬ 
ture. The main technical contribution is to derive update equa¬ 
tions that are used iteratively for estimation of parameters in 
the new RVM. We refer to the new RVM as the RVM for com¬ 
bined sparse and dense noise (SD-RVM). By an approximate 
analysis, the SD-RVM algorithm is shown to be equivalent to 
the minimization of a non-symmetric sparsity inducing cost 
function. Finally, the performance of SD-RVM is evaluated 
numerically using examples from compressed sensing, block 
sparse signal recovery, house price prediction and image de- 
noising. Throughout the paper, we take an approach of com¬ 
paring SD-RVM vis-a-vis the existing Robust Bayesian RVM 
(RB-RVM) ||6l (described in the next section). 


distribution p(y|7, i/,/ 3 ) becomes equivalent to maximiz¬ 
ing the joint distribution p(y, 7, iz, ^), in the limit of non- 
informative priors. In calculations, however, the parameters 
(a, b, c, d) are often given small values to avoid numerical 
instabilities. To estimate [x^ RB-RVM fixes the 

precisions and sets 

[x^ e^]^=/ 3 SflB[A IjTy, ( 4 ) 

^^«=((o N ’ 

where r = diag(7i,72,... ,7„) andN = (liSig{vi,V2, 

The RB-RVM iteratively updates the precisions by maximiz¬ 
ing p{y, 7, zz, /?), resulting in the update equations 


1.3. Prior work 


To establish relevance of our work we briefly describe prior 
work in this section. Almost all prior works naiTiisiii trans¬ 
late the linear setup Q to the equivalent setup 


y = [ A i„ 


X 

e 


+ n, 


( 2 ) 


where 1 ^ is the mxm identity matrix, [A Im ] acts as the 

effective system matrix and [x^ e^] ^ acts as the parameter 
vector to be estimated. The RB-RVM of 0 uses the standard 
RVM approach for (|^ directly. Hence RB-RVM learns model 
parameters for all three signals x = [xi, X2, ■ ■ ■, Xn]^, e = 
[ci, €2, ... ,em]^ and n, and thus estimates both x and e 
jointly. RB-RVM assumes Gaussian priors 

n m 

X - J|7V(0,7,"^), e - ]Ja/'(0,zz,"^), n - ^(0, 

i=l 

where the precisions (inverse variances) a^, zz* and /3 are un¬ 
known. The precisions are given Gamma priors 

p(7i) = Gamma(7i|a +l,b), ( 3 ) 

piixi) = Gamma(zzi|o -I- l,b), 
p{P) = Gamma(/ 3 |c -I- 1 , d), 

where Gamma(7i|a + 1 , 6 ) oc ID. Typical practice 

is to maximize p(y|7, v, ( 3 ) to infer the precisions, where we 
used boldface symbols to denote the vectors 

IX = [zZl,ZZ2,...,ZZm]^. 


1 ^new _ ^ Ri[^RB]n+i,n+i 

-n-—+12-> (5) 

||y- Ax- e||| 
where denotes the {i,i) component of the matrix 

^RB- 

The update equations Q and Q are found by applying 
the standard RVM to (|^. Derivations can be found in e.g. ID 
0. Iterating until convergence gives the final estimates x and 
e. In the iterations, some precisions become large, making 
their respective components in x and e close to zero. This 
makes the final estimate of x and e sparse. 

RVM has high similarity with Sparse Bayesian Learning 
(SBL) EliniElll. Sparse Bayesian learning has been used 
for structured sparse signals, for example block sparse signals 
Qa, where the problem of unknown signal block structure 
was treated using overlapping blocks. The model extension of 
RB-RVM shown in (|^ for handling block sparse noise with 
unknown block structure is straight-forward to derive. How¬ 
ever, in our formulation, as we are not estimating the noise 
explicitly, the use of block sparse noise with unknown block 
structure is non-trivial. 

Further, non Bayesian (even not statistical) methods have 
been used for sparse estimation problems lIl[Il[T9l|20l. For 
example, the £i-norm minimization method justice pursuit 
(IP) Q uses the optimization technique of the standard ba¬ 
sis pursuit denoising method ca, as follows 

x,e = argmin ||x||i + ||e||i s.t. ||y - Ax - e||2 < e, 

x,e 

(6) 


new 

li = 


j^new _ 


Instead we take the alternative (full Bayesian) approach of 
maximizing p(y, 7, zz, / 3 ) and assume that precisions have 
non-informative prior by taking the limit (a, 6, c, d) —> 0 . For 
the distributions considered here, maximizing the conditional 


where e > 0 is a model parameter set by the user. For un¬ 
known noise power, it is impossible to know e a-priori. We 
mention that a fully Bayesian setup like the RVM does not 
require parameters set by a user. 







2. RVM FOR COMBINED SPARSE AND DENSE 
NOISE (SD-RVM) 

2.1. SD-RVM Method 

For Q, we propose to use a combined model for the two ad¬ 
ditive noises, as follows 


and the determinant lemma ETIl 

det(B-i + AE-^A^) = det(S-i)det(r'^)det(B-i). 

( 12 ) 

Using ( [TT] i and ( |T^ we hnd that £ is maximized w.r.t. 7 ^ 
when 


e-f n-A((0,B-i), (7) 

where B = diag(/3i, / 32 , • ■ ■, Pm)- We also use /3 to denote 
the vector /3 = [/3i, / 32 , ■ ■ ■, That means the two noises 

are treated as a single combined noise where each noise com¬ 
ponent has its own precision. The rationale is that we do not 
need to seperate the two noises. Although our model pro¬ 
motes sparsity in the noise we empirically hnd that it is able 
to model sparse and non-sparse noise. Using the noise model 
0 and that x ~ nr=i-^(^’ 7 'i ^)’ maximum a 

posteriori (MAP) estimate 

x= SA^By, 

s = (r +aTba)”\ 



1 




(13) 


Instead of solving for 7 ^ (which would require solving a non¬ 
linear coupled equation since S and x depend on 7 ^) we ap¬ 
proximate the equation as 


1 - 7 ,E,, + 2a - (^2 + 26)7^“'= 0. (14) 


We solve ( [T4l l for 7 ™®’" rather than ( [32l l for 7 ^ since it in prac¬ 
tice often results in a better convergence Ellll. The update 
equation then becomes 


7i 


1 7z£zi “t“ 2a 

3:1 +2b 


where as before F = diag( 7 i, 72 ,... , 7 „). The precisions Setting a = 5 = 0 we obtain 0 . 
are updated as For the noise precisions we use that 


new _ 1 li^^i 

7i ^2 


^new ^ 


1-/?,-[ASAT]^,- 

[y-Ax]2 ^ 


( 8 ) 

(9) 


where The derivations of 0 and (|^ are given in 

the next section. 


2.2. Derivation of update equations for SD-RVM 


^ [y^(B-i + Ar-^A^)-^y] = [y - Ax]J. (15) 
We hnd that C is maximized w.r.t. Pj when 

where Aj ^ denotes the j’th row vector of A. Rewriting the 
equation as 


To update the precisions we maximize the distributionp(y, 7 ,/ 3 ) = 1 — p^Aj^-SAj. + 2c — ([y — Ax]| -f 2d)P'^™ = 0, 
F(y| 7 ) lP)p{l)p{l3p (obtained by marginalizing over x), with 

respect to 7 ^ and Pj, where we use the prior using that A^ :SAj. = [ASA^J^j, we hnd that 


P{Pj) = Gamma(/3j|c-|- 

and p( 7 i) is as in ([^. The log-likelihood of the parameters is 
C =constant — ^ log det(B~^ + AF^^A^) (10) 

-^yT(B^i+AF-iAT)-V 

n m 

+ ^(alog7i - bji) + ^(clog/3j - dPj). 

i=i j=i 


We maximize C w.r.t. 7 ^ by setting the derivative to zero. 
To simplify the derivative we use that 

^ (y^(B-i -t- AT-^A^)-^y) = xj, ( 11 ) 


pnew ^ 


l-PjlAllA^ 


2c 


[y - Ax ]2 + 2d 


Setting c = d — 0 we obtain (|9 1 . 

The derivations of ( [TT] ) and (are given in Appendix|6.1| 


2.3. Analysis of sparsity 

Several approximations are made in the derivation of the iter¬ 
ative update equations. It is interesting to see how the approx¬ 
imations affect the sparsity of the solution. In this subsection, 
we show that the approximations make the SD-RVM equiva¬ 
lent to minimizing a non-symmetric sparsity promoting cost 
function. 

To motivate that the standard RVM is sparsity promoting, 
one can use that the marginal distribution of Xi is a student-t 







distribution. For a fixed /3 (and e = 0), the standard RVM is 
therefore an iterative method for solving (details can be found 
in 11 ) 


R ^ 

mm |||y - Ax ||2 + (l + ^) 5 ]log(x? + 26). 

1=1 

The log-sum cost function can be used as a sparsity promot¬ 
ing cost function, making it plausible that the RVM promotes 
sparsity. 

For the SD-RVM, the precisions are updated by maxi¬ 
mizing the marginal distribution p(y,7,/3). The problem is 
equivalent to maximizing C in ( |3l . We show approximations 
for relevant parts of the right hand side of C as follows 

logdet(S-i) « logdet((S°'‘^)-i)-f 

n m 

Y, - 7f") + (16) 

i=i j=i 

where the approximation is up to first order in 7 and /3. We 
rewrite the problem in variables x and e using that UTij 

n m 

yT(Ar”^A^ =min^7*a;? + 

x.e 

(17) 

such that Ax -f e = y 

where now e = e -f n as in (|li. Under the approximation ( [T6] l 
and the reformulation maximization of logp(y, 7 ,/ 3 ) 
becomes equivalent to 

n 

+ (1 + 2a) log(7i)] 

2=1 

m 

+ "Y + ‘2d)/3j -f (1 -f 2c) log(/3j) . 

1=1 

such that Ax -f e = y 

By minimizing the objective with respect to 7 ^ and /3j, the 
problem reduces to 


min (1 -f 2 a) V log(a ;2 -f -f 26) (18) 

X.e 

2=1 

m 

+ (1 + 2 c) Y log(e| + + 2d), 

1=1 

such that Ax -f e = y 

where we have ignored additive constants. Because of the 
approximations, the constants T,°-^ and make 

the cost function non-symmetric in the components of x and 
e. The SD-RVM is thus equivalent to minimizing a non- 
symmetric sparsity promoting cost function. In a similar way 



Fig. 1. The non-symmetric log-ball log(a:f -1-0.02) -|-log(x| -f 
0.1) < 0. SD-RVM is equivalent to finding the small¬ 
est non-symmetric log-ball that intersects the linear subspace 
Ax -f e = y. 

it can be shown that the standard RVM and RB-RVM are 
also equivalent to non-symmetric cost functions under ap¬ 
propriate approximations. A two-dimensional example using 
X = [xi, ^ 2 ]^ is shown in Fig. 

2.4. Computational complexity 

In this section we take a non-rigorous approach for quanti¬ 
fying the computational complexity of SD-RVM. The com¬ 
plexity is computed per iteration, since the number of iter¬ 
ations depends on the stopping criterion used, and with the 
assumption of a naive implementation. Each iteration of SD- 
RVM requires 0{n^) flops to compute the matrix S using 
Gauss-Jordan elimination ll24l . Updating the precisions re¬ 
quires 0{nm) flops since the residual y — Ax needs to be 
computed. Hence the computational complexity of SD-RVM 
is 


0(max(nTO, n^)) = 0(n ■ max(m,n^)). 

A natural interest is the complexity of RB-RVM. Again with 
the assumption of a naive implementation, each iteration of 
RB-RVM requires the inversion of a (n-hm) x (n-fm) matrix 
to compute Sub- Updating the precisions requires 0{nm) 
flops and hence the computational complexity of RB-RVM is 

C>(max(nm, (n -f m)^)) = 0{{n m)^). 

In Section [4T| we provide numerical evaluations to quantify 
algorithm run time requirements that confirm that SD-RVM 
is typically faster than RB-RVM. 

3. SD-RVM WITH BLOCK STRUCTURE 

3.1. SD-RVM for known block structure 

To describe a block sparse signal x G K" with known block 
structure we partition [n] = { 1 , 2 ,..., n} into blocks as 

[n] = Ji U /2 U • • • U Ip, 




where \Ii\ = rii and li Olj =0 for i ^ j. The signal is block 
sparse when only a few blocks of the signal are non-zero. The 
component-wise SD-RVM generalizes to this scenario by re¬ 
quiring that the precisions are equal in each block, i.e. we 
choose the prior distribution for the components of block 
to be 

X/. ~ 

where xj. G denotes the vector consisting of the compo¬ 
nents of X with indices in . 

Similarly we can partition the components of the sparse 
noise e G K™ into blocks as 

[to] = Jl U J2 U ■ ■ ■ U Jq, 

where | J^j = mj, Jj H Ji = ^ for i ^ j and the block Jj of 
e is given the prior distribution 

As before, the precisions are given gamma distributions ([^ as 
priors, where now 




Fig. 2. Illustration of non-overlapping and overlapping block 
parameterizations. 


p{f3j) = Gamma(/?j|c-|- l,d). (19) 


Using this model, we derive the update equations of preci¬ 
sions as below 


rii - 7 itr(S/J -I- 2a 
l|x/Jli + 2& ’ 

rrij - Pj tr([ASA^] ) + 2c 
||(y-Ax)jJ|2 + 2d 


( 20 ) 

( 21 ) 


where S/. denotes the rii x n-i submatrix of S formed by 
elements appropriately indexed by By setting li = {i} 
and Jj = {j} we obtain the update equations for component¬ 
wise sparse signal and noise. We see that when li = {i} and 
Jj = J = [to], then ( |2l] i reduces to the update equations of 
the standard RVM since 


n- 13 tr(ASA^) = y^ -fiJ^u. 

i 

The derivation of the update equations ( |20l i and is 
found in Appendix |6. 2 1 


3.2. SD-RVM for unknown block structure 

In some situations the signal can have an unknown block 
structure, i.e. the signal is block sparse, but the dimensions 
and positions of the blocks are unknown. This scenario can 
be handled by treating the signal as a superposition of block 
sparse signals ifTbll (see illustration in Figure [^. This ap¬ 
proach also describes the scenario Q when e is component 
wise sparse and n is dense (e.g. Gaussian). The precision 
of each component is then a combination of the precisions 


of the blocks to which the component belongs. Let 7 ^ be 
the precision of the component Xi and 7 ^ be the precision of 
block Ik- We model the signal as 

X, - A/'( 0 , 7 ,"^), 

= E Tfc-'- (22) 

We model the noise in a similar way with precisions /3j for 
component j and precisions Pi for the block with support J;. 
To promote sparsity, the precisions of the underlying blocks 
are given gamma distributions as priors 

7 fe ~ Gamma( 7 fcla + 1 , 6 ), 

Pi ^ Gamma(/3i j c + 1, d). 

In each iteration we update the underlying precisions 7 ^. The 
componentwise precisions are then updated using ( |2^ . With 
this model, the update equations for the precisions become 

^ ^tr(rfc)-^tr(rfcSrfc) + 2 a 

4||r,xii2-f26 

itr(B 0 - Ut{BiA^'EABi) + 2 c 

onew _ Pi _Pi_ 

' " illBz(y-Ax)lli + 2 d 

where is the diagonal matrix with [rfc]^^ = ^i'lf i G Ik and 
[rfc]ii = 0 otherwise. We denote the corresponding matrix for 
Pi by B;. The componentwise precisions are updated using 
( [ 22 I 1 and similar for Pj. 

We see that when the underlying blocks are disjoint, then 
li = Ik for all I G Ik and Pj = Pi for all j G Ji- The 


(23) 

(24) 









































update equations then reduce to the update equations ( |2T] l for 
the block sparse model with known block structure. 

3.2.1. Sparse and dense noise 

In the model ([T]i where x and e are componentwise sparse and 
n is dense, then 

h = = {j}> Jm+i = N, (25) 

where i = 1,2,... ,n and j = 1,2,... ,m. In this scenario 
the support set of the sparse and dense noise is overlapping, 
so the update equations for the precisions become 


new _ 1 + 2a 

li ~ .'-.2 , o ;. ’ 


^new ^ 


l-a[ATSA] 


33 


■2c 


/3j[y- Ax]2 + 2d 


i = 1,2, 


anew _ 

Pm+1 — ' 


ET=i /3. - 


Er=i/5|[A^SH,+2c 


Er=i/3 ?[y-A*] 


2d 


Pj — ^ + /3m+l) 


We will use these update equations in the simulations where 
the signal is component-wise sparse and the noise is a sum 
of (component-wise) sparse and dense noise. It turns out that 


this method is slightly better than the SD-RVM in section 2.1 


4. SIMULATION EXPERIMENTS 

In this section we evaluate the performance of the SD-RVM 
using several scenarios - for simulated and real signals. For 
simulated signals, we considered the sparse and block sparse 
recovery problem in compressed sensing. Then for real sig¬ 
nals, we considered prediction of house prices using the 
Boston housing dataset ||25]| and denoising of images con¬ 
taminated by salt and pepper noise. In the simulations we 
used the cvx toolbox ll26l to implement JR 


4.1. Compressed sensing 

The recovery problem in compressed sensing consists of esti¬ 
mating a sparse vector x € K" in ([T]) from m linear measure¬ 
ments, where m <C n. To evaluate the performance of the 
algorithms, we generated measurement matrices A G 
by drawing their components from a A/^(0,1) distribution and 
scaling their column vectors to unit norm. We selected the po¬ 
sitions of the active components of x and e uniformly at ran¬ 
dom and draw their values from A/^(0,1). In the simulation 
we draw the additive noise n from A/^(0, We com¬ 

pared JP, the standard RVM, RB-RVM and SD-RVM. For JP 
0' we assumed a„ to be known and set e = a„ \/m + 2s/2m 
as proposed in [na. 



Fig. 3. NMSE vs. m/n for outlier free measurements. 



Fig. 4. NMSE vs. m/n for 5% outliers contaminated mea¬ 
surements. 


In the simulations we varied the measurement rate m/n 
(ratio of the number of measurements and the signal dimen¬ 
sion) for measurements without outliers and with 5% outliers. 
We chose n — 100 and fixed the signal-to-dense-noise-ratio 
(SDNR) 

SDNR = E[\\A^\\l]/E[\\n\\l] = \W{mal), 

to 20 dB. By generating 100 measurement matrices and 100 
vectors x and e for each matrix we numerically evaluated the 
Normalized Mean Square Error (NMSE) 

nmse = e;[||x-x||2]/f;[||x||2]. 

The results are shown in Eigurej^and Eigurej^ We found that 
SD-RVM outperformed the other methods. The improvement 
of SD-RVM over RB-RVM was 1 to 1.5 dB for m/n > 0.5, 
with and without outliers. Compared to JP, the improvement 
of SD-RVM was 3 to 3.7 without outlier noise and 1 to 4 
dB with outlier noise when m/n > 0.5. The poor perfor- 
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Fig. 5. Histogram of cputimes for the compressed sensing 
problem. Time is in seconds. 


mance of RVM is due to sensitivity to the regularization pa¬ 
rameters. The performance of RVM improves greatly when 
the regularization are optimally tuned, however, the optimal 
values varies with SNR and measurement dimensions. The 
experiments show that the performance of SD-RVM does not 
degrade in the absence of sparse noise. 

For each realization of the problem we measured the run¬ 
time (cpu time) of each algorithm. The histogram of the run¬ 
times is shown in figure We found that the runtimes of 
the RVM algotithms (the standard RVM, RB-RVM and SD- 
RVM) were shorter than the runtime of JP and the runtimes of 
JP were spread over a larger range. Of the RVM algorithms, 
SD-RVM had the highest concentration of low runtime (< 0.2 
seconds), while the runtimes of the standard RVM and RB- 
RVM was more concentrated around 0.2 seconds. The his¬ 
togram in figure l^has been truncated to only show percentage 
for the visible values . 

4.2. Block sparse signals 

The recovery problem in compressed sensing can be general¬ 
ized to block sparse signals and noise l28l . For block sparse 
signals, the signal components are partitioned into blocks of 
which only a few blocks are non-zero. Sparse Bayesian learn¬ 
ing (SBL) extended to the block sparse signal case is often 
referred to as block SBL (BSBL) ifThl fTTl . The problem of 
unknown block structure can be solved by overparametrizing 
the blocks Uhl. In BSBL Uhl, the signal is modelled as 



Fig. 6. NMSE vs. m/n for signals with known block structure 
and 5% outliers in measurements. 



Fig. 7. NMSE vs. m/n for signals with unknown block struc¬ 
ture and 5% outliers in measurements. 


The resulting problem can then be solved using BSBL for 
known block structure. When the minimum block size Kmin 
is known, the summation can be restricted to subsets of size 

|/| = na. 

The SD-RVM can be extended to the block sparse case 


using the methods developed in Section 3.1 and Section 3.2 


Justice Pursuit can be extended to the block sparse case in a 
similar way as BSBL by setting 


X, e = arg mm 


El 




Eii^l 


| 2 ) 


(26) 


such that I |y — Ax — e| I 2 < e 


Ax= ^ A/x/, 

IC[n] 

i.e. the measured signal is modelled as a sum of signals 
where each signal represents a block of the original signal. 


where the sum runs over all blocks (non-overlapping or over¬ 
lapping) and as before we assume the noise variance to be 
known and set e = cr„\/m -f VSm. Eor unknown block 
structure we also compared with component sparse methods 
RVM, RB-RVM and SD-RVM. 


























To numerically evaluate the performance of the block 
sparse algorithms we varied the measurement rate m/n for 
measurements with 5% sparse noise. We set the signal dimen¬ 
sion to n = 100 and hxed the SDNR to 20 dB. We divided the 
signal X into 20 blocks of equal size of which 3 blocks were 
non-zero. The sparse noise consisted of blocks with 5 com¬ 
ponents in each block. In the sparse noise, 5% of the blocks 
were active. For known block structure, the blocks were 
choosen uniformly at random from a set of predehned and 
non-overlapping blocks while for unknown block structure, 
the first component of each block was choosen uniformly at 
random, making it possible for the blocks to overlap. The 
active components of the signal and the sparse noise were 
drawn from N{G, 1). By generating 50 measurement matri¬ 
ces A and 50 signals x and sparse noises e for each matrix 
we numerically evaluated the NMSE. 

For known block structure we found that block SD-RVM 
outperformed the other methods. The NMSE of the block 
sparse SD-RVM was lower than the NMSE of block JP by 
more than 10 dB for m/n = 0.3, 0.4 and from 2 to 6 dB 
lower for m/n > 0.5. The results are presented in Eigure|^ 

Eor unknown block structure we found that for m/n < 
0.5, SD-RVM for unknown block structure gave best perfor¬ 
mance while for m/n > 0.5, the usual component sparse 
SD-RVM gave the best perfomance. The NMSE of JP for un¬ 
known block structure was about 5 dB larger than the NMSE 
of block SD-RVM for 0.4 < m/n < 0.6, while for m/n > 
0.9 block JP gave a better NMSE than block SD-RVM. As 
expected, RVM and BSBL gave poor performance since they 
are not developed to handle measurements with sparse noise. 
The results are shown in hgure]^ 

4.3. House price prediction 

One real world problem is the prediction of house prices. To 
test the algorithms on real data, we used the Boston housing 
dataset ll25l . The dataset consists of 506 house prices in sub¬ 
urbs of Boston and 13 parameters (air quality, accessibility, 
pupil-to-teacher ratio, etc.) for each house. The problem is 
to predict the median house price for part of the dataset (test 
data) using the complement dataset (training data) to learn re¬ 
gression parameters. We model the house prices as 

Pi = vf] yi + ni + Ci, 

where pi is the price of house i, G contains the pa¬ 
rameters of house i, X G is the regression vector, is 
(Gaussian) noise and is a (possible) outlier. Very expensive 
or inexpensive houses can treated as outliers. The goal is to 
estimate the median house price for the test set. We find the 
median by estimating the regression parameters and setting 

m = median(W^x), 

where W contains the parameters of the houses in the test 
set. It is believed that only a few parameters are important to 


Table 1. Prediction of median houseprice using the Boston 
Housing dataset. Mean error and mean cputime (in seconds) 
for different fractions, p, of the dataset used as training set. 



RVM 

RB-RVM 

SD-RVM 

p 

Error 

Cputime 

Error 

Cputime 

Error 

Cputime 

0.3 

1.24 

0.18 

0.43 

0.60 

0.38 

0.15 

0.4 

1.26 

0.29 

0.39 

1.25 

0.35 

0.25 

0.5 

1.27 

0.42 

0.39 

2.20 

0.36 

0.38 

0.6 

1.28 

0.60 

0.41 

3.28 

0.37 

0.53 

0.7 

1.28 

0.92 

0.45 

5.27 

0.43 

0.80 
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Fig. 8. PSNR vs. percentage of salt and pepper noise (p) 
averaged over 7 images with 10 noise realizations for each 
image for each value of p. 


the average customer, x can therefore be modelled as a sparse 
vector. 

We used a fraction p of the dataset as training data and 
the rest as test set. By choosing the training set uniformly 
at random we evaluated the mean absolute error of the pre¬ 
dicted median and mean cputime (in seconds) over 1000 real¬ 
izations. 

We found that SD-RVM gave 10% to 5% lower mean er¬ 
ror than that of RB-RVM and the mean etTor of RB-RVM and 
SD-RVM was about 70% lower than the error of the RVM 
(see Table [^. The cputime of SD-RVM was 16% to 25% of 
the cputime of RB-RVM. 

4.4. Image denoising 

A grayscale image (represented in double-precision) can be 
modelled as an array of numbers in the interval from 0 to 
1. Common sources of noise in images are electronic noise 
in sensors and bit quantization errors. Salt and Pepper m 
noise makes some pixies black (0) or white (1). To test the 
algorithms we added p percent of salt and pepper noise in 
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Fig. 9. One realization of salt and pepper noise denoising for p — 0.2. Columns from left to right: Noisy image, median filter, 
RB-RVM and SD-RVM. Peak Signal to Noise Ratio (PSNR) has been rounded to two decimals. 


7 different images (Boat, Baboon, Barbara, Elaine, House, 
Lena and Peppers) and denoised them using the median biter, 
the RVM, the RB-RVM and the SD-RVM. The pixels were 
set to either black or white with equal probability. 


vectorization operator 1291 . i.e. 




The median biter estimates the value of each pixel by the 
median in a 3 x 3 square patch. For the RVM, RB-RVM and 
SD-RVM, the value of a pixel was estimated by forming a 
5x5 square patch around the pixel. In the patch, the pixels 
were modeled as ll29l 


y, = j3Q + (3l (x - Xj) -f/3jvech((x - Xi)(x - x^)^) -f m, 


Given the regression parameters, the value of the centtal pixel 
is estimated as y = /3o. Since pixels close to the central 
pixel are more important, the errors are weighted by a ker¬ 
nel, iT(x, Xi). The estimation problem thus becomes 


p 

a XI k* - /3o - /37(x - X,) 

P0^Pl-,P2 . T ' 

2 = 1 


—/3^vech((x — Xi)(x 


x*)^) 


2 

iT(x,x,), 


where rii is noise, x is the position of the central pixel, x^ is 
the position of pixel i, i = 1, 2,..., 25 and vech is the half- 


where we used the kernel 

iT(x,Xj) = exp (-||x-Xi|| 2 /r^) (l-fx^x,)^, 

















Table 2. Mean cputime (in seconds) for denoising images cor¬ 
rupted by salt and pepper noise averaged over 7 images. 


For fixed precisions 7 and (3, the Maximum A Posteriori 
(MAP) estimate of x becomes 


Algorithm 

Mean cputime 

X = argmax logp(y,: 

Median filter 

5 


RVM 

925 

= arg min (y — Ax 

X 

RB-RVM 

1154 

= SA^By, 

SD-RVM 

788 



the kernel is a composition of a Gaussian and polynomial ker¬ 
nel 1611^. In the simulation we used r = 2.1 and p = 1 as 
in 0 . To avoid overfitting, it is beneficial to promote sparsity 
in [Po, f3j, f32V Mmm- 

We compared the algorithms by computing the Peak Sig¬ 
nal to Noise Ratio (PSNR) 


PSNR = -10 • logio 


( f;[||x-x||^] \ 

\E[maxij \Xij\‘^]{pq) j 


where S = (F + B A) ^. The form of the MAP estimate 

is the same for all models considered in this paper. 

6.1. Derivation of ( fTT] i and ( fT5| i 
Proof of ([TTJ. Since 

n 

B-i + Ar-^A^ = B -1 + ^ 

i=l 

where is the i’th column vector of A we find that 


where the size of the image is p x 5 and the expectation is 
taken over the different images and realizations of the noise. 
All images in the simulation were of size p = q = 256 with 
maxi j = 1. Figure 1^ shows one realization of the 

problem, where SD-RVM gives lower PSNR than the median 
filter and RB-RVM. In the simulations we varied p and used 
10 noise realizations for each image. The result is shown in 
figure]^ 

We found that the median filter performed best for p < 
0.1 while SD-RVM outperformed RB-RVM for all values of 
p and also the median filter for p > 0.2. The gain in using 
SD-RVM over the median filter was significant for the images 
Boat, Elaine, Lena, House and Peppers (for p > 0.2). The 
mean cputime of SD-RVM was 68 % of the mean cputime of 
RB-RVM (see Table 4.4 1 while the median filter was by far 
the fastest method. 


^(yT(B-i+Ar-iAT)-iy) 

= 7 -^ (a7(B-l+A^-lA^)-V)^ 

Using that 

r^iA^(B-i + Ar^iA^)-V (27) 

= r-^A^ (B - BA(r + A^BA)-iA^B) y 

= F'^A^By - F'l ;^B_A (F + A^BA)-i A^By 

=s-i-r 

= SA^By = X, 

we get that 

^ (y^(B-i + AF-1 aT)-V) = 7”" 


5. CONCLUSION 

In this paper we introduced the combined Sparse and Dense 
noise Revelance Vector Machine (SD-RVM) which is robust 
to sparse and dense additive noise. SD-RVM was shown to 
be equivalent to the minimization of a non-symmetric sparsity 
promoting cost function. Through simulations, SD-RVM was 
shown to empirically perform better than the standard RVM 
and the robust RB-RVM. 


6. APPENDIX: DERIVATION OF UPDATE 
EQUATIONS 

Here we derive the update equations for x, 7 ^ and f3j for the 
SD-RVM in Sections [n]|3ll and[l2l 


□ 

Proof of CD. Since 

yT(B'i + AF"^A^)-V 

= y^By - yBA (F -f A^By, 

we get that 

^(y^(B~i-|-AF-iAT)-V) 

= y] - 2yj Aj-:SA^By -b y^BASAj^^A^-^SA^By 
= y] - 2 % Aj- :X -b (Aj- :X)2 = [y - Ax]|. 

□ 









6.2. Known block structure 

Let r and B be diagonal matrices with 

[r]fcfc = 7j, if /c e h, [B]n = Pj, if I G Jj, 
and zero otherwise. 

To update the precisions we maximize the marginal dis¬ 
tribution 

p(y,7,B) =p(y|7./3M7M/3)> 

with respect to 7 and (3, where ^( 7 ) and p{j3) is as in ([^ and 
( [T^ . The log-likelihood of the parameters is 

C = const. — i log det(B^^ + AF”^ A^) 

-iyT(B-i+Ar-iAT)-iy 

P Q 

+ ^(alog7 - 67*) + ^(clog^j- - dPj). 

i=i j=i 

Using we get that £ is maximized when 
dC 1 . rii a , 

— = --tr(S,J + ^ + --& 

07* 2 27* 7j 


1 

2^; 


2 mA;,(B-i + AF-'A ' )-Vll 2 = 0 , (28) 


where S/. G is the submatrix of S consisting of the 

columns and rows in £. Further, using ( |27] ) we get that 

AT(B^i+AF-1A^)-V = 7»x/.. (29) 

Thus, ( [28] l is fulhlled when 

^ ^ -b- h\xu\\l = 0 . 

2 27j 7j 2 

As before, instead of solving for 7 ^ we rewrite the equation as 

n, - 7 .tr(S,J + 2a - (||x,,||2 + 26 ) 7 ^"' = 0. (30) 

Solving ( [ 30 I 1 for 7 '*®“’ gives us the update equation ( |20l i. 

To hnd the update equation for Pj we use that 


^[y^(B- + AF-A^)-V]=||y.,||^ 


_9 

- 2yT ASATBy + y^BASA^. ASA^By 

= ||(y-Ax)^J| 2 , 

where A / . consists of the row vectors of A which row num- 
ber belongs to Jj. We get that 

dPj 2 ^ 2Pj 

- ^Il(y-Ax)jj|^ + -^-d = o. 


Rewriting the equation as 

1 - /3jtr(Aj^_:SA]._.) -b 2 c 
-(||(y-Ax)^J|2 + 2d)/?—=0, 

and using that tr(Aj^.:SAj, .) = tr([ASA^]j^) gives us 
the update equation ( | 2 T| ). 

6.3. Unknown block structure 

When the block structure is unknown, we use the over¬ 


parametrized model in section 3.2 The log-likelihood of 
the parameters is 

£ = logp(y| 7 ,/ 3 )p( 7 )p(/ 3 ) 

= const. - i log det(B^i TAF-^A^) (31) 

-iyT(B-i + AF-iAT)-iy 

P Q 

+ + ^{c^ogPj - dPj). 

i=l j=l 

We search to maximize pT] ) with respect to the underlying 

variables pk and Pi. Using that - 7 fc ^ 

when i G Ik and zero otherwise, ( fl^ and p7| ) we find that £ 

is maximized when 


=-t4tr(SF2) + 42tr(Ffe) 

k 


djk 27^*^“*"^ ' 2 f 

1 -T-ri 2 - « 


- 772 ^ BfcX + y- b = 0, 


(32) 


27fc ■■ Ik 

By rewriting ( |3^ as 

^tr(Ffc) - ;(*^tr(FfcSFfc) -b 2a 
7fc Ik 

- (^^||Ffcx||2 + 26)7r“ = 0. (33) 

Solving ( |3^ for 7 | 1 ®“' gives us the update equation ( |2^ . 

For the noise precisions, we similarly hnd that 

= -777tr(B;A^SAB0 + 


dPi 2pf 


1 . . .,,,9 C 


27 


-^||B,(y-Ax)|b + j-d = 0, 

By rewriting the expression as 

-^tr(B/) —^tr(B;A^SAB;) -b 2c 
A Pi 


Pi 


^,,Biiy-ASc)\\i + 2d\pr^ = 0, 




we find the update equation ( |24| l. 

We see that the form of update equations depends on how 
the equations are rewritten. The form used here has the ad¬ 
vantage of reducing to when the underlying blocks are 
disjoint. 
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